nnLB: Next Pitch
A Novel Deep Learning Approach to MLB Pitch Prediction Using In-Game Video Footage
Abstract
The importance of analytics in baseball has grown considerably in recent decades. Accordingly, Major League Baseball (MLB) organizations have invested substantial resources into the research and development of advanced statistical methods that can be leveraged to gain competitive advantages. Pitch prediction has emerged as one of these active areas of research. Here we develop a novel deep learning approach for pitch prediction. Using pose estimation time-series data from in-game video footage, we train two pitcher-specific convolutional neural networks (CNNs) to predict the pitches of Tyler Glasnow (2019 season) and Walker Buehler (2021 season). Notably, our selected model achieves a prediction accuracy of 87.1% and an area under the curve (AUC) of 0.919 on a holdout test set for Tyler Glasnow’s 2019 season. These results demonstrate the effectiveness of using in-game video footage and deep learning for pitch prediction tasks.
Introduction
Organizations across all major sports leagues have adopted data-driven decision-making approaches to remain competitive in recent decades. Among these leagues, Major League Baseball is widely recognized as the pioneer in embracing analytics. In fact, an entire domain of sports-based analytics, termed sabermetrics, is devoted to baseball-specific statistics and analysis. Consequently, a wealth of high-resolution public data and untapped opportunities exist within the world of baseball.
Baseball enthusiasts would agree that success within the sport relies heavily on the game within the game. Identifying and exploiting small advantages can yield significant returns in achieving desired outcomes. Here, we introduce a deep learning method that utilizes in-game video footage to predict pitches. This endeavor is motivated by two factors. First, a reliable pitch classifier can provide batters with an edge during live at-bats. Second, an interpretable deep learning model can give pitchers insight into how predictable they are and how they can conceal their pitches more effectively.
Methods
Defining the Sample
Prediction tasks must be segmented by pitcher since pitchers have unique motions, tendencies, and pitching arsenals (i.e., pitchers throw different types of pitches). As such, we decided to focus our proof of concept analysis on two pitchers: Tyler Glasnow (2019 regular season) and Walker Buehler (2021 regular season). Further, most pitchers pitch from two separate positions (the windup and the stretch), which is conditional on game situation. We decided to focus on pitches thrown from the stretch since the motion is more compact.
Data Collection
I. Web Scraping
BaseballSavant is a website dedicated to providing the public with access to historical MLB data. These data include video footage and Statcast tabular data for every pitch thrown in the MLB since 2018 and 2015, respectively. We built web scrapers to retrieve both the video source URL and pitch type for every pitch of interest. Video source URLs were used as inputs for our feature engineering pipeline.
II. Feature Engineering
A highlight in our work is the feature engineering process, termed
the Video2Data pipeline. The pipeline works as follows. First,
a video is downloaded from the source URL and is converted to a series
of images (or frames). Second, an object detection model is used to
determine the location of the pitcher in each frame. The coordinates
reported by the model are subsequently used to blur the background of
each image. Blurring is performed using OpenCV.
The object detection model used in this step is a custom Detectron2
model (Faster R-CNN) that was trained on a self-annotated data set to
specifically detect pitchers. This step is necessary for scalable and
reliable feature extraction since the OpenPose pose estimation software
(used in the following step) detects humans non-specifically. Third, the
OpenPose pose estimation software is used on each image to extract the
coordinates of 25 keypoints
on the pitcher’s body. Keypoint coordinates from each frame are finally
merged into a similar data structure. Example outputs generated at each
step of the Video2Data pipeline are shown in Figure 1.
Figure 1. Example Video2Data pipeline outputs. (1) Video to image conversion (left). (2) Pitcher detection and background blurring (middle). (3) OpenPose pose estimation (right).
III. Data Preprocessing
Videos accessible through BaseballSavant are inherently inconsistent (i.e., in terms of video duration, camera perspective, etc). Additionally, the Video2Data pipeline produced some undesired artifacts that could be harmful for machine learning applications (e.g., missingness and pose estimation errors). As such, we implemented a data cleaning process that both detects/removes unusable observations and prepares usable observations for modeling.
The first step of our data preprocessing method addresses the inconsistent video duration problem. To crop each video to a similar range, we use pose estimation data to identify the frame at which the pitcher’s knee reaches it’s \(y_{max}\), which we refer to as \(t_{y_{max}}\) (i.e., the frame at which the left knee’s \(y\)-coordinate is maximized if the pitcher is right-handed, and vice versa). We decided to use \(t_{y_{max}}\) as a reference time point since \(t_{y_{max}}\) reliably corresponds to a common event in a pitcher’s delivery (termed the peak leg lift), irrespective of who is pitching. We then use \(t_{y_{max}}\) to determine the start and end frames, termed \(t_{start}\) and \(t_{end}\), using Equations 1-3. Briefly, \(t_{start}\) and \(t_{end}\) are computed for each observation and are used to crop each observation’s OpenPose coordinate data to a common 15 time points. An example of the frame range included for our prediction task is shown in Figure 2. Observations with \(t_{start} < 0\) are considered to have an insufficient number of frames for inclusion and are removed from consideration.
\[ n_{frames} = 15 \tag{Equations 1-3}\\ t_{end} = t_{y_{max}} + 2\\ t_{start} = t_{end} - (n_{frames}-1)\\ \]